Scraping

This is a simple script that downloads files referenced as links in an HTML file. I uses this to scrape PDFs from Coursera.org course pages.

(c) Florian Hoppe 2013


In [36]:
import urlparse
import urllib2 as urllib
import urllib
import sys
import os
import threading
from lxml import etree

def _download_link(url,target_dir):
    urlpath = urlparse.urlsplit(url)
    if url.startswith("https"):
        print("Attention: link is a HTTPS URL. Might cause problems.")
#         url = "http" + url[5:]
    filename = target_dir + "/" + urllib.unquote(urlpath[2])[1:].replace("/","_")
    
    print("loading " + url + " and saving to " + filename)
    if os.path.exists(filename):
        print(filename + " already exists. download is aborded.")
    else:
        try:
            with open(filename, "wb") as f:
                req = urllib.urlopen(url)
                f.write(req.read())
                print("Done with " + filename)
        except IOError, err:
            print("IO Error for " + url + "\nDue to " + err)
        

def download_all( filename, link_filter = 'pdf'):
    """Searches the given sourcefile for links and downloads their targets.

    Args:
        filename: should reference a HMTL file in the current working directory
        link_filter: string that contains the extension of the files that should be downloaded
    Returns:
        nothing. the link targets will be saved in a subdirectory e.g. "pdfs" (must exists before calling the function).
    Raises:
        nothing
    """
    
    parser = etree.HTMLParser()
    tree = etree.parse(filename, parser)

    links = tree.xpath('//a')
    
    for link in links:
        if 'href' in link.attrib:
            url = link.attrib['href']
            
            if urlparse.urlparse(url).path[-3:] == link_filter:
                t = threading.Thread(target=_download_link,args=(url,link_filter + 's'))
                t.start()
#                download_link(url,link_filter + 's')

In [12]:
cd /Users/florianhoppe/Documents/SkyDrive/Coding/python/scrapy_tutorial


[Errno 2] No such file or directory: '/Users/florianhoppe/Documents/SkyDrive/Coding/python/scrapy_tutorial'
/Users/florianhoppe/Documents/Onedrive/Wissen/MOOCs/Practical ML

In [2]:
cd /Users/florianhoppe/Documents/Onedrive/Wissen/MOOCs/


/Users/florianhoppe/Documents/Onedrive/Wissen/MOOCs

In [4]:
cd Practical\ ML


/Users/florianhoppe/Documents/Onedrive/Wissen/MOOCs/Practical ML

In [20]:
cd ../Developing\ Data\ Products


/Users/florianhoppe/Documents/Onedrive/Wissen/MOOCs/Developing Data Products

In [21]:
mkdir pdfs

In [37]:
download_all('Coursera.html')


Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/shiny.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_shiny.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/shiny2.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_shiny2.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/manipulate.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_manipulate.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/rCharts.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_rCharts.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/googleVis.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_googleVis.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/slidify.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_slidify.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/RPackages.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_RPackages.pdf
Attention: link is a HTTPS URL. Might cause problems.
loading https://github.com/bcaffo/courses/blob/master/09_DevelopingDataProducts/lectures/classes-methods.pdf?raw=true and saving to pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_classes-methods.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_shiny2.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_manipulate.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_RPackages.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_classes-methods.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_shiny.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_googleVis.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_rCharts.pdf
Done with pdfs/bcaffo_courses_blob_master_09_DevelopingDataProducts_lectures_slidify.pdf

In [ ]: